From bioinformatics to computational biology.

نویسنده

  • J M Claverie
چکیده

It is quite ironic that the uncertainty about the number of human genes (28,000–120,000) (Ewing and Green 2000; Liang et al. 2000; Roest Crollius et al. 2000) appears to increase as the determination of the human genome sequence is nearing completion. I shall contend here that this paradox reveals deep epistemological problems, and that “bioinformatics”—a term coined in 1990 to define the use of computers in sequence analysis—is no longer developing in directions relevant to biology. After the pioneers who established the basic concepts of molecular sequence analysis (Fitch and Margoliash 1967; Needleman and Wunsch 1970; Chou and Fasman 1974), most computational biologists of my generation (the second one) embarked on their journey into the emerging discipline with the ambition to turn it into the bona fide theoretical branch of molecular biology. Having a physicist’s background, I suspect that many of us had the vision of establishing bioinformatics in a leadership role over experimental biology, similar to the supremacy that theoretical physics enjoys over experimental physics. Somewhere along the line, it seems that bioinformatics lost this ambition and became sidetracked onto what physicists would call a “phenomenological” pathway. Let us follow the example of particle physics for a little longer. There, theoretical research has two phases (which, in fact, run in parallel). In the first phase (socalled phenomenological), a large number of physical events are recorded in huge raw databases, classified into separate groups based on statistical regularities, and then utilized to identify the most recurrent objects. Optimal database design, fast classification/ clustering algorithms, and data mining software are the main area of development here. The level of knowledge gained from this phase is, for instance, that objects A and B often appear together except when C is around, or when parameter X is lower than a certain threshold; it is mostly statistical in nature. The parallel with the current state of bioinformatics is clear. However, theoretical physics also has a subsequent, totally different phase, aiming at discovering the basic (few) rules (e.g., E = mc) underlying the relationships between the objects, their individual properties, and thus finally explaining the statistical distributions of the events recorded in the databases. Once known, these rules considerably simplify the description of the database content and, more important, have a predictive power: the realm of the theory may encompass objects or events that have not been observed previously. This part of theoretical endeavor is entirely missing in current bioinformatics. As a consequence, we are still not able to agree on the number of human genes despite having the complete sequence of the human genome at hand. Identifying precisely the 5 and 3 boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA (“exon parsing”) has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997; Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. Encoding a protein imposes a variety of constraints on nucleotide sequences, which do not apply to noncoding regions of the genome. These constraints induce statistical biases of various kinds, the most discriminant of which was soon recognized to be the distribution of six nucleotide-long “words” or hexamers (Claverie and Bougueleret 1986; Fickett and Tung 1992). Initial gene parsing methods were then simply based on word frequency computation, eventually combined with the detection of splicing consensus motifs. The next generation of software implemented the same basic principles into a simulated neural network architecture (Uberbacher and Mural 1991). Finally, the last generation of software, based on hidden Markov models, added an additional refinement by computing the likelihood of the predicted gene architectures (e.g., favoring human genes with an average of seven coding exons, each 150 nucleotides long) is added (Kulp et al. 1996; Burge and Karlin, 1997)). These ab initio methods are used in conjunction with a search for sequence similarity with previously characterized genes or expressed sequence tags (EST). Sadly, it is often claimed that matching back cDNA to genomic sequences is the best gene identification protocol; hence, admitting that the best way to find genes is to look them up in a previously established catalog! Thus, the two main principles behind state-of-theart gene prediction software are (1) common statistical regularities and (2) plain sequence similarity. From an E-MAIL [email protected]; FAX +33-4-91-1645-49. Article and publication are at www.genesdev.org/cgi/doi/10.1101/ gad.155500. Commentary

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design, Modeling and Computational Analysis of crRNA to Regulate MetastamiR-10b and MetastamiR-126 in Post-transcriptional Level by CRISPR-C2c2 (Cas13a) Technique

Introduction: Metastasis is one the most important causes of mortality in cancer patients. Recent studies have shown the metastatic potential of a specific group of microRNAs called metastamirs.  miR-126 is shown to be correlated with the colorectal liver metastasis. Also, overexpression of miR-10b has been reported in metastatic breast cancer.  Therefore, down regulation of these miRNAs at tra...

متن کامل

Design, Modeling and Computational Analysis of crRNA to Regulate MetastamiR-10b and MetastamiR-126 in Post-transcriptional Level by CRISPR-C2c2 (Cas13a) Technique

Introduction: Metastasis is one the most important causes of mortality in cancer patients. Recent studies have shown the metastatic potential of a specific group of microRNAs called metastamirs.  miR-126 is shown to be correlated with the colorectal liver metastasis. Also, overexpression of miR-10b has been reported in metastatic breast cancer.  Therefore, down regulation of these miRNAs at tra...

متن کامل

The Importance of α-CT and Salt bridges in the Formation of Insulin and its Receptor Complex by Computational Simulation

Insulin hormone is an important part of the endocrine system. It contains two polypeptide chains and plays a pivotal role in regulating carbohydrate metabolism. Insulin receptors (IR) located on cell surface interacts with insulin to control the intake of glucose. Although several studies have tried to clarify the interaction between insulin and its receptor, the mechanism of this interaction r...

متن کامل

The Importance of α-CT and Salt bridges in the Formation of Insulin and its Receptor Complex by Computational Simulation

Insulin hormone is an important part of the endocrine system. It contains two polypeptide chains and plays a pivotal role in regulating carbohydrate metabolism. Insulin receptors (IR) located on cell surface interacts with insulin to control the intake of glucose. Although several studies have tried to clarify the interaction between insulin and its receptor, the mechanism of this interaction r...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Computational prediction of miRNAs in Nipah virus genome reveals possible interaction with human genes involved in encephalitis

Current re-emergence of Nipah virus (NiV) in India caused 11 deaths so far and many patients were kept in quarantine. A thorough study of previous outbreaks occurred in Malaysia, Bangladesh and India represents cases with high rate of fatality due to acute encephalitis. Our work involves genome analysis of NiV for prediction of miRNAs and their targeted genes in human in order to understand enc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome research

دوره 10 9  شماره 

صفحات  -

تاریخ انتشار 2000